2 research outputs found
Comparing the performance of oversampling techniques in combination with a clustering algorithm for imbalanced learning
Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceImbalanced datasets in supervised learning are considered an ongoing challenging task for standard
algorithms, seeing as they are designed to handle balanced class distributions and perform poorly
when applied to problems of the imbalanced nature. Many methods have been developed to address
this specific problem but the more general approach to achieve a balanced class distribution is data
level modification, instead of algorithm modifications. Although class imbalances are responsible for
significant losses of performance in standard classifiers in many different types of problems, another
aspect that is important to consider is the small disjuncts problem. Therefore, it is important to
consider and understand solutions that not only take into the account the between-class imbalance
(the imbalance occurring between the two classes) but also the within-class imbalance (the imbalance
occurring between the sub-clusters of each class) and to oversample the dataset by rectifying these
two types of imbalances simultaneously. It has been shown that cluster-based oversampling is a robust
solution that takes into consideration these two problems. This work sets out to study the effect and
impact combining different existing oversampling methods with a clustering-based approach.
Empirical results of extensive experiments show that the combinations of different oversampling
techniques with the clustering algorithm k-means – K-Means Oversampling - improves upon
classification results resulting solely from the oversampling techniques with no prior clustering step